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Abstract 

We  present  a  new  document  retrieval  approach  com¬ 
bining  relevance  feedback,  pseudo-relevance  feedback, 
and  Markov  random  field  modeling  of  term  interaction. 
Overall  effectiveness  of  our  combined  model  and  the 
relative  contribution  from  each  component  is  evaluated 
on  the  GOV2  webpage  collection.  Given  0-5  feedback 
documents,  we  find  each  component  contributes  unique 
value  to  the  overall  ensemble,  achieving  significant  im¬ 
provement  individually  and  in  combination.  Compara¬ 
tive  evaluation  in  the  2008  TREC  Relevance  Feedback 
track  further  shows  our  complete  system  typically  per¬ 
forms  as  well  or  better  than  peer  systems. 

Introduction 

User  queries  can  be  understood  as  surrogates  for  un¬ 
derlying  information  needs.  While  we  might  assume 
the  information  needs  are  fairly  well-defined,  the  corre¬ 
sponding  queries  are  often  terse  and  incomplete.  Conse¬ 
quently,  performing  retrieval  strictly  on  the  basis  of  an 
observed  query  often  yields  low  retrieval  accuracy  and 
especially  poor  recall.  A  common  strategy  for  address¬ 
ing  this  is  to  infer  additional  details  regarding  the  infor¬ 
mation  need  given  a  set  of  documents  either  known  or 
thought  to  be  relevant.  When  the  user  provides  one  or 
more  such  feedback  documents  in  addition  to  his  query, 
we  have  the  scenario  known  as  relevance  feedback  (RF). 

This  paper  presents  a  strategy  for  effectively  lever¬ 
aging  varying  amounts  of  feedback  (documents):  none 
(a.k.a.  ad  hoc  retrieval),  one,  a  few,  or  many.  One  tech¬ 
nique  we  employ,  pseudo-relevance  feedback  (PRF),  au¬ 
tomatically  induces  additional  feedback  documents  and 
uses  them  to  further  expand  the  query  (Lavrenko  & 
Croft  2001;  Zhai  &  Lafferty  2001).  Although  PRF  has 
been  primarily  investigated  with  ad  hoc  retrieval,  it  has 
the  potential  for  greater  effectiveness  in  the  RF  setting 
since  explicit  feedback  improves  system  ranking  for  au¬ 
tomatically  identifying  related  documents.  Alongside 
PRF,  we  also  investigate  the  benefit  of  modeling  term 
interactions  in  the  RF  scenario.  Specifically,  we  adopt 
Markov  random  field  (MRF)  modeling  of  sequential  de¬ 
pendencies  between  terms  (Metzler  &  Croft  2005). 

*An  earlier  version  of  this  paper  appeared  in  the  TREC 
2008  Conference  Notebook. 


Given  these  two  techniques,  PRF  and  MRF  mod¬ 
eling,  we  evaluate  the  benefit  from  applying  each  in¬ 
dividually  and  in  combination  across  varying  RF  con¬ 
ditions.  Given  0-5  feedback  documents,  we  find  each 
component  contributes  unique  value  to  the  overall  en¬ 
semble,  achieving  significant  improvement  individually 
and  in  combination.  Additional  experiments  using  RF 
in  absence  of  MRF  or  PRF  yield  results  consistent  with 
community  wisdom  that  a  little  feedback  can  make  a 
big  difference.  Finally,  comparative  evaluation  of  our 
complete  system  in  the  2008  TREC  Relevance  Feed¬ 
back  track  shows  our  approach  typically  performs  as 
well  or  better  than  peer  systems. 

Method 

This  section  describes  our  overall  approach.  After 
briefly  summarizing  our  combined  model,  we  pro¬ 
ceed  to  review  the  individual  techniques  employed: 
query-likelihood  (Lafferty  &  Zhai  2001),  relevance  and 
pseudo-relevance  feedback  (Lavrenko  &  Croft  2001), 
and  Markov  random  field  modeling  of  sequential  term 
dependencies  (Metzler  &  Croft  2005). 

Model  Summary 

Given  an  input  query  Q  and  feedback  documents  F, 
our  overall  method  may  be  summarized  as  follows: 

0.  Unigram  document  models  QD  are  estimated  for  each 
document  via  Dirichlet  smoothing  (Equation  3) 

1.  A  unigram  query  model  03  is  estimated  from  Q  via 
maximum-likelihood  (Equation  2) 

2.  A  unigram  RF  model  0F  is  estimated  as  the  average 
document  model  over  the  set  of  positive  (i.e.  rele¬ 
vant)  feedback  documents  (Equation  4) 

3.  An  improved  unigram  query  model  03  is  produced 
by  linearly  mixing  ©3  and  0P  models  (Equation  6) 

4.  0^'  is  used  as  the  unigram  component  Jt  in  the  MRF 
model  to  yield  P'A(D\Q)  (Equation  11) 

5.  A  unigram  psuedo-relevance  model  Qp  is  estimated 
based  on  P^(D\Q)  (Equation  12) 

6.  The  PRF  unigram  likelihood  0P  •  <dD  is  linearly 
mixed  with  the  P^(D\Q)  MRF  model  (Equation  14) 
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Query-Likelihood 

We  adopt  the  query-likelihood  (Ponte  &  Croft  1998) 
paradigm  for  information  retrieval.  In  this  language 
model  (LM)  approach,  we  assume  each  observed  docu¬ 
ment  D  (of  \D\  words)  is  generated  by  an  underlying 
LM  parameterized  by  <dD  (the  document  model) .  Given 
an  input  query  Q  (of  |<5|  words),  we  infer  D’s  relevance 
to  Q  as  the  probability  of  observing  Q  as  a  random 
sample  drawn  from  0D .  Assuming  bag-of- words,  QD 
specifies  a  unigram  distribution  {0^  . . .  0^N  }  over  the 
collection  vocabulary  V  =  {w± . .  .wn}.  Finally,  let¬ 
ting  denote  the  frequency  of  word  w  in  Q,  query- 
likelihood  can  be  expressed  in  log  form  as: 

logp(Q\D )  =  fS  log  9°  =  fQ  •  log  0D  (1) 

wGQ 


where  the  final  dot  product  is  taken  over  the  entire 
collection  vocabulary  (equivalent  since  =  0  for  all 
terms  not  observed  in  the  query). 

While  this  formulation  of  query-likelihood  is  perfectly 
valid,  incorporating  lexical  statistics  from  feedback  doc¬ 
uments  into  it  is  cumbersome  since  the  relative  impor¬ 
tance  of  terms  can  only  be  expressed  through  repetition. 
To  address  this,  Equation  1  can  be  generalized  by  as¬ 
suming  the  observed  Q  is  merely  representative  of  a  la¬ 
tent  query  model  parameterized  by  0Q  =  {62 1  ■  ■  ■  62v  } , 
consistent  with  intuition  that  the  underlying  informa¬ 
tion  need  might  be  verbalized  in  other  ways  besides  Q. 
Query  likelihood  may  then  be  re-expressed  in  terms  of 
©Q>s  maximum-likelihood  (ML)  estimate  @Q  =  w\fQ 

fQ  ■  log  eD  =  \Q\QQ  ■  log  eD  r=k  -V(QQ\\Qd)  (2) 


This  shows  inferring  document  relevance  on  the  basis 
of  P(Q\D)  is  equivalent  to  ranking  according  to  mini¬ 
mal  KL-divergence  Z>(0^ H©15)  when  0^  is  estimated 
by  ML  (Lafferty  &  Zhai  2001).  Intuitively,  better  re¬ 
trieval  can  be  achieved  by  forgoing  strict  equivalence 
with  Equation  1  and  instead  seeking  more  accurate  in¬ 
ference  of  0*2.  This  is  where  relevance  feedback  fits  in: 
it  can  be  leveraged  in  conjunction  with  the  observed 
query  to  better  estimate  0Q. 

Regarding  0D,  we  apply  standard  Dirichlet  smooth¬ 
ing  to  estimate  it  as  a  mixture  between  document  D 
and  collection  C  (of  |Cj  words)  ML  estimates  (Zhai 
&  Lafferty  2004;  Zaragoza,  Hiemstra,  &  Tipping  2003; 
Lease  &  Clrarniak  2008): 


a|l'|+('1  A)|C|  ’  A  | D 
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(3) 


where  g  specifies  hyper-parameter  strength  of  the  prior. 


Relevance  Feedback 

Given  a  query,  our  retrieval  model  (Equation  2)  infers 
relevance  on  the  basis  of  similarity  between  (our  es¬ 
timates  of)  query  and  document  models,  0^  and  0D . 
While  we  have  thus  far  focused  on  document  ranking  for 
a  given  query,  let  us  now  consider  the  other  direction  of 


query  formulation.  Given  a  set  of  relevant  documents  1Z 
that  match  a  user’s  information  need,  the  optimal  query 
model  0?  under  Equation  2  will  exhibit  greater  similar¬ 
ity  to  TVs  latent  document  models  V dgii0D  than  those 
of  other  documents.  This  suggests  that  given  partial 
knowledge  of  TZ  in  the  form  of  \T\  feedback  documents 
where  T  C  TZ,  0^  might  be  estimated  on  the  basis  of 
similarity  to  T .  For  example,  a  simple  idea  would  be  to 
estimate  0^  as  the  average  document  model  over  the 
set  of  positive  (i.e.  relevant)  feedback  documents: 

1  1  D&F 

While  the  classic  Rocchio  method  (Rocclrio  &  others 
1971)  also  incorporates  negative  feedback  (7  term): 


negative  feedback  has  typically  been  found  to  be  far  less 
useful  than  positive  feedback,  and  so  we  omit  it  com¬ 
pletely  in  our  system.  Since  retrieval  time  is  typically 
proportional  to  the  number  of  terms  used,  a  common 
efficiency  heuristic  is  to  approximate  0F  by  its  kF  most 
likely  terms  and  re-normalize1 . 

Although  the  approach  in  Equation  4  does  provide 
broader  lexical  coverage  of  7 Z  than  available  in  the 
original  query  string,  it  suffers  from  a  different  prob¬ 
lem.  Whereas  Q  tends  to  closely  focus  on  the  core  in¬ 
formation  need,  the  average  feedback  document  model 
may  diverge  from  it  since  documents  in  T  likely  discuss 
many  topics.  Rocchio’s  a  qo  mixing  term  helps  prevent 
such  drift,  and  we  adopt  the  same  solution  here  by  in¬ 
ferring  0^  on  the  basis  of  both  the  original  query  and 
the  feedback  documents  in  the  form  of  a  linear  mixture: 

0°'  =  (1  -  XF)  0Q  +  X F  Qf  (6) 

Despite  the  simplicity  of  this  approach,  recent  studies 
have  shown  it  comparable  to  more  sophisticated  strate¬ 
gies  (Balog,  Weerkamp,  &  de  Rijke  2008;  Yi  &  Allan 
2008).  Consequently,  we  adopt  it  here  in  our  work. 

Combining  Equations  1,  2,  and  6,  we  see  that  uni¬ 
gram  feedback  can  be  equivalently  interpreted  as  a  mix¬ 
ture  of  query  models  used  in  the  original  ranking  func¬ 
tion  (Equation  1)  or  as  a  mixture  of  ranking  functions: 

P(Q\D)  r=k  log  0D  ■ 0Q ' 

=  log  &D  ■  [(1  -  A F)  Qq  +  XF  0F] 

=  (1-  XF)[log0D  -QQ}  +  XF[logQD  •  0F] 

r=k  (1  -  XF)  V(0Q  | \Qd)  +  XF  V{Qf  I \Qd) 
However,  once  we  move  away  from  unigram  modeling 
to  perform  MRF  modeling  instead,  we  will  see  that  this 
dual  interpretation  is  no  longer  applicable. 

1  Since  Equation  2  is  a  linear  model,  ranking  is  invariant 
under  any  scaling  of  the  weight  vector  and  so  normalization 
does  not  affect  ranking.  However,  if  we  wish  to  later  use  0F 
in  some  mixture  model,  choice  of  kF  will  have  a  side-effect 
on  mixture  weight  unless  normalization  is  performed. 


The  Markov  Random  Field  Model 

The  Markov  random  field  (MRF)  approach  (Metzler  & 
Croft  2005)  models  the  joint  distribution  P\(Q,  D)  over 
queries  Q  and  documents  D.  It  is  constructed  from  a 
graph  G  consisting  of  a  document  node  and  nodes  for 
each  query  term.  Nodes  in  the  graph  represent  random 
variables  and  edges  define  the  independence  semantics 
between  the  variables.  In  particular,  a  random  variable 
in  the  graph  is  independent  of  its  non-neighbors  given 
observed  values  for  its  neighbors.  Therefore,  different 
edge  configurations  impose  different  independence  as¬ 
sumptions.  The  joint  distribution  over  the  random  vari¬ 
ables  in  G  is  defined  by: 

Pa(Q,D)  =  ±-  []  V’(cA)  (7) 


where  C(G)  is  the  set  of  cliques  in  G,  each 
ip(-;  A)  is  a  non-negative  potential  function  over 
clique  configurations  parameterized  by  A,  and  Z\  — 
YLq  d  riceC(G)  A)  computes  the  partition  function. 
For  document  ranking,  we  can  skip  the  expensive  com¬ 
putation  of  Z\  and  simply  score  each  document  D  by 
its  unnormalized  joint  probability  with  Q  under  the 
MRF.  If  we  define  our  potential  functions  as  ^(c;  A)  = 
ea;p[Ac/(c)],  where  /(c)  is  some  real- valued  feature  func¬ 
tion  over  clique  values  and  Ac  is  that  feature  function’s 
assigned  weight,  the  posterior  Pa{D\Q)  is  computed  as: 


Pa(D\Q) 


Pa(Q,D) 

Pa(Q) 

°=k  Y  109  c ;  A) 

ceC(G) 

Y  Ac^(c) 

ceC(G) 


(8) 


The  graph  G  can  be  constructed  in  various  ways  de¬ 
pending  on  various  possible  assumptions  regarding  in¬ 
dependence  between  terms.  In  the  case  of  full  indepen¬ 
dence. ,  query  term  nodes  share  an  edge  with  the  docu¬ 
ment  only.  With  sequential  dependence ,  adjacent  terms 
in  the  query  share  an  additional  edge  in  G.  Finally, 
assuming  fidl  dependence  constructs  an  edge  between 
each  pair  of  query  term  nodes.  The  choice  of  graph 
structure  determines  the  set  of  cliques  present  in  G  and 
thereby  the  set  of  features  used  in  ranking.  We  use  the 
sequential  dependence  MRF  in  our  work  since  the  full 
dependence  model  is  expensive  to  compute  due  to  its 
combinatorial  feature  growth  and  provides  only  slight 
improvement  in  accuracy  (Metzler  &  Croft  2005). 

All  of  the  potential  functions  used  in  the  MRF  can 
be  expressed  in  the  following  generic  form: 


logipi(c,  A)  =  A  fog 


(1-0 


Si(c)  pSjjc ) 
\D\  +  i  \C\ 


(9) 


where  <S)(c)  denotes  a  given  statistic  computed  for  the 
given  clique  c,  |D|  and  \C\  indicate  respective  token 
counts  of  the  document  and  entire  collection  (statistics 


other  than  term  frequency  are  only  approximately  nor¬ 
malized),  and  af  =  — qqTjy,  where  /q  denotes  a  smooth¬ 
ing  hyper-parameter  specific  to  the  potential  function 
tpi(c;  A)  (Zhai  &  Lafferty  2004).  Note  that  use  of  term 
frequency  as  the  statistic  S)  computes  the  standard 
Dirichlet-smoothed  unigram  (Equation  3). 

Potential  functions  are  primarily  distinguished  by  the 
particular  statistic  5)  they  employ.  The  MRF  model  ex¬ 
ploits  three  classes  of  lexical  features:  individual  terms, 
contiguous  phrases,  and  proximity.  Each  of  these  corre¬ 
sponds  to  a  distinct  statistic  S{ :  term  frequency,  phrase 
frequency  (i.e.  “ordered”  Indri  #1  operator),  and  fre¬ 
quency  of  a  set  of  terms  within  some  parameter  iV-sized 
window  (i.e.  “unorclered”  Indri  #uwN  operator).  The 
latter  two  multi-term  statistics’  corresponding  poten¬ 
tial  functions  are  applicable  when  some  form  of  depen¬ 
dency  is  assumed  between  query  terms  in  the  graph 
structure.  In  particular,  the  phrasal  potential  func¬ 
tion  is  only  applied  to  cliques  connecting  contiguous 
query  terms,  whereas  the  proximity  potential  function 
is  applied  to  all  multi-term  cliques,  contiguous  and  non¬ 
contiguous  alike.  This  means  each  pair  of  contiguous 
query  terms  generates  a  clique  c  whose  potential  func¬ 
tion  is  defined  by  the  product  ip0{c)ipu(c)  of  ordered  and 
unordered  potential  functions. 

Using  these  three  classes  of  potential  functions,  the 
MRF  can  be  expressed  as  a  three  component  mix¬ 
ture  model  computed  over  term,  phrase,  and  proximity 
feature  classes.  Omitting  clique  parameterization  and 
computation  of  the  partition  function,  we  can  see  that 
each  class  effectively  computes  its  own  ranking  function 
which  is  then  mixed  with  that  of  the  other  classes: 

Pa(Q,  D)  oc  A t/t  +  A ofo  +  A ufu  (10) 

Note  that  unigram  likelihood  (Equation  2)  can  be 
equivalently  formulated  as  an  MRF  in  which  At  =  1 
and  A o  =  A u  =  0.  This  means  an  improved  unigram 
model  0Q  (e.g.  better  estimated  via  feedback)  can  be 
used  in  place  of  the  MRF’s  standard  /t  unigram  model: 

-Pa  (A  Q)  ex  At[Oq  •  log  QD]  +  \ofo  +  A ufu  (11) 

Pseudo-Relevance  Feedback 

PRF  is  quite  similar  to  RF  except  that  now  we  must 
factor  in  our  uncertainty  regarding  each  feedback  docu¬ 
ment’s  relevance  to  the  query.  While  our  original  setup 
in  Equation  4  made  a  simplifying  assumption  that  all 
feedback  documents  were  equally  relevant,  this  estimate 
can  be  improved  by  accounting  for  varying  degree  of 
relevance  across  the  feedback  set.  The  straightforward 
way  to  accomplish  this  is  to  generalize  from  the  simple 
average  of  Equation  4  to  instead  compute  an  expecta¬ 
tion  respecting  some  arbitrary  estimate  p(D\Q)  of  feed¬ 
back  document  relevance  with  respect  to  the  query  Q: 

ep  =  ed^p{d ,q)[0d]  =  Y  p(d\Q)  e°  (12) 

Dec 

where  C  denotes  the  document  collection.  Recall 
the  MRF  model  defines  a  joint  distribution  P\(Q,D) 


expressed  unnormalized  in  Equation  10.  While  we 
could  compute  the  full  partition  function  to  normalize 
Pa(Q,  D)  over  the  entire  document  collection,  this  is 
unnecessary  unless  we  want  to  use  the  entire  collection 
for  feedback.  Besides  the  large  computational  cost  this 
would  incur,  there  is  diminishing  return  and  increasing 
harm  from  query  drift  as  we  start  sifting  through  lower 
ranks.  Instead,  we  can  simply  normalize  with  respect 
to  the  set  of  PRF  documents  V  only: 


Pa(D\Q) 


Pa{Q,D) 

d) 


(13) 


The  expected  PRF  document  model  can  then  be  easily 
computed  by  Equation  12  above.  As  with  RF,  a  com¬ 
mon  efficiency  heuristic  is  to  approximate  0P  by  its  kp 
most  likely  terms  and  re-normalize.  The  original  esti¬ 
mate  of  0^  is  also  typically  mixed  with  the  0P,  similar 
to  what  was  done  with  explicit  feedback  (Equation  6). 

When  using  PRF  in  conjunction  with  the  MRF 
model,  we  must  specify  how  0P  is  mixed  with  original 
model:  query  model  mixing  (i.e.  in  the  fp  component) 
or  ranking  function  mixing.  We  adopt  Indri’s  formu¬ 
lation  (Metzler  et  al.  2005)  incorporating  PRF  at  the 
level  of  the  ranking  function: 

Pa(D\Q)  =  A P[logQp-  0P]  +  {1-\p)P'a{D\Q)  (14) 


using  P^(D\Q)  as  defined  in  Equation  11.  Note  PRF  is 
limited  here  to  unigram  modeling;  we  do  not  estimate 
dependency  statistics  from  PRF  for  revising  fo  and  fp 
components  since  previous  work  has  shown  little  benefit 
from  doing  so  (Metzler  &  Croft  2007a). 


Evaluation 

This  section  describes  evaluation  performed  in  develop¬ 
ing  and  testing  our  model.  Table  1  provides  a  complete 
listing  of  all  model  parameters  and  identifies  which  re¬ 
main  fixed  in  our  experiments.  We  follow  previous  work 
in  setting  MRF  proximity  parameters  for  window  size 
w proximity  and  Dirfchlet  smoothing  fl  proximity  ■ 

Track  Protocol  and  Metrics 


Component 

Parameter 

Value 

Unigram 

1700 

Relevance  Feedback 

A  F 
kp 

varied 

varied 

A  t 

varied 

A  o 

varied 

MRF 

At/ 

1  — Ay  — Ao 

proximity 

8 

M  proximity 

4000 

A  p 

varied 

Pseudo-rel  Feedback 

kp 

50 

\V\ 

10 

Table  1:  Parameters  of  our  combined  model. 


2007  Million  Query  track  (50  and  214  topics,  respec¬ 
tively).  Documents  chosen  for  feedback  achieved  the 
highest  median  retrieval  ranks  in  the  earlier  track  from 
which  the  topic  was  drawn  using  the  best  run  submitted 
by  participating  groups.  All  odd-numbered  and  some 
even-number  Terabyte  topics  were  excluded  from  the 
test  set  and  so  available  for  model  development;  evalua¬ 
tion  on  test  topics  was  blind.  Top-2500  document  rank¬ 
ings  were  submitted  for  official  runs  though  reported 
results  include  top-1000  ranked  documents  only. 

Cumulative  metric  performance  across  topics  is  gen¬ 
erally  computed  by  a  simple  (arithmetic)  average  over 
per-query  metric  performance.  The  one  exception, 
geometric-mean  average  precision  (gmap),  adopts  the 
geometric  mean  instead  in  order  to  focus  metric  at¬ 
tention  on  difficult  topics.  Primary  metrics  used  were 
(arithmetic-mean)  average  precision  (AP)  and  top-10 
precision  (P010),  as  reported  by  trec_eval  8.12.  Be¬ 
sides  gmap,  we  also  report  R-Precision  (rprec):  pre¬ 
cision  after  R  documents  retrieved,  where  R  is  the 
number  of  relevant  documents  for  each  topic.  Results 
marked  as  significant  ^  (p  <  .05),  highly  significant^  (p  < 
.01),  or  neither  reflect  agreement  between  a  two-sided 
paired  t-test  and  random  shuffling  statistics  computed 
by  Indri’s  ireval  (Smucker,  Allan,  &  Carterette  2007). 


Model  evaluation  was  performed  as  part  of  our  partic¬ 
ipation  in  the  2008  TREC  Relevance  Feedback  Track. 
A  goal  of  the  track  was  to  establish  strong  baselines 
for  current  RF  techniques  under  varying  amounts  of 
explicit  feedback: 

A:  no  feedback  (i.e.  ad  hoc  retrieval) 

B:  1  relevant  document 

C:  3  relevant  and  3  non-relevant  documents 

D:  10  judged  documents 

E:  large  amounts  of  feedback  (40-800  documents) 

Each  feedback  set  was  included  as  a  subset  of  its  larger 
successors.  Retrieval  experiments  were  conducted  on 
the  GOV2  webpage  collection  (25,205,179  documents) 
with  264  title-field  queries  drawn  from  topics  of  2004- 
2006  Terabyte  tracks  (TREC  topics  701-850)  and  the 


Experimental  Setup 

Indri  (Strohman  et  al.  2004)  formed  the  basis  of  our 
retrieval  model.  Since  Indri  does  not  provide  a  facility 
for  performing  RF,  however,  we  estimated  the  feedback 
model  0P  externally.  Queries  were  stopped  at  query 
time  using  a  418  word  INQUERY  stop  list  (Allan  et  al. 
2000)  and  then  Porter  stemmed3.  Recall  that  term  pair 
features  fo  and  fu  from  the  dependency  model  (Equa¬ 
tion  10)  correspond  to  co-occurrence  statistics  tracking 
pairs  of  words  occurring  consecutively  or  within  some 
proximity  of  one  another.  It  is  worth  noting  that  Indri 
replaces  stopwords  with  out-of-vocabulary  tokens  and 
so  use  of  stopwords  does  not  affect  distance  between 
terms  in  computed  co-occurrence  statistics. 

2http : //tree . nist . gov/trec_eval 

3http : //www . tartarus . org/martin/PorterStemmer 


Model 

A 

B 

C 

D 

Unigram 

PRF 

29.18 

!30.84 

32.50! 

!31.94 

32.47 

!33.49 

!34.32! 

MRF 

32.04! 

32.55! 

!34.61! 

!35.62i 

MRF+PRF 

35.28! 

34.78! 

35.37! 

!36.66! 

Table  2:  (Mean)  average  precision  achieved  by  different 
model  configurations  on  development  topics.  Parame¬ 
terization  is  consistent  with  Table  3  except  Uf  =  150 
is  used  with  all  feedback  runs.  Statistical  significance 
is  reported  by  prefix  f  and  $  comparing  against  cell  to 
left  (i.e.  less  feedback),  while  suffix  compares  PRF  & 
Unigram,  MRF  &  Unigram,  and  MRF+PRF  &  MRF. 


Model 


Unigram 


MRF+PRF 


Table  3:  Parameterization  of  submitted  runs. 

MRF+PRF  values  are  identical  for  C  and  D  conditions. 


Run 

kp 

A  F 

A2 

- 

- 

B2 

250 

0.3 

C2 

150 

0.45 

D2 

150 

0.45 

El 

250 

0.8 

A1 

- 

- 

B1 

150 

0.3 

Cl 

150 

0.45 

D1 

150 

0.45 

- 

- 

- 

0.8 

0.1 

0.5 

0.8 

0.1 

0.75 

0.9 

0.05 

0.85 

0.9 

0.05 

0.85 

submitted 

runs. 

For  model  development,  track  protocol  did  not  spec¬ 
ify  which  documents  to  use  for  feedback  with  non-test 
topics.  While  it  would  have  been  ideal  to  choose  docu¬ 
ments  achieving  high  rank  under  ad  hoc  retrieval,  mir¬ 
roring  testing  conditions,  we  simply  took  feedback  doc¬ 
uments  for  each  topic  according  to  their  order  in  the  col¬ 
lection  assessments.  Initially  we  tried  evaluating  cross- 
validated  performance  over  different  choices  of  feedback 
documents,  but  we  ended  up  abandoning  this  practice 
due  to  time  constraints.  Since  our  RF  method  made 
no  use  of  negative- feedback,  our  choice  of  feedback  in¬ 
volved  only  relevant  documents.  For  condition  D,  we 
always  used  5  relevant  documents  rather  than  vary  the 
number  per  topic  as  in  testing  conditions.  Finally,  with 
condition  E  we  simply  used  all  relevant  documents  un¬ 
der  an  assumption  that  once  so  many  feedback  doc¬ 
uments  were  available,  the  exact  number  would  make 
little  difference.  We  did  not  test  this  assumption,  how¬ 
ever,  and  so  it  bears  some  scrutiny  in  future  work. 

Tuning  was  performed  with  feedback  documents  in¬ 
cluded  in  evaluation  due  to  a  misinterpretation  of 
track  protocol.  This  led  to  selection  of  parameter  set¬ 
tings  which  likely  overfit  feedback.  Despite  the  non¬ 
optimality  of  this  tuning  process,  our  development  set 
results  presented  below  do  properly  exclude  feedback 
documents  and  so  support  useful  analysis.  Of  the  98 
topics  originally  used  in  tuning,  we  discard  three  which 
have  fewer  than  five  non-feedback  relevant  documents, 
leaving  95  for  evaluation.  Since  condition  E  tuning  used 
all  relevant  documents  as  feedback,  its  performance  can 
only  be  evaluated  with  feedback  documents  included. 
Consequently,  this  condition  is  largely  omitted  in  our 
discussion  of  development  set  results. 

Results  on  Development  Topics 

Parameter  values  were  tuned  on  development  topics  via 
grid  search  (Metzler  &  Croft  2007b),  resulting  in  the 
values  listed  in  Table  3.  Results  in  Table  2  compare 
baseline  unigram  AP  with  that  achieved  using  PRF, 
MRF,  and  MRF+PRF  combined.  While  results  gen¬ 
erally  show  improvement  with  increasing  feedback,  the 
more  interesting  observation  is  seeing  how  the  tech¬ 
niques  contribute  and  interact  with  one  another  in  com¬ 
parison  to  the  baseline  and  across  feedback  conditions. 


Model  Run 

AP  gmap  rprec  P@10 

A2 

B2 

Unigram  C2 
D2 

29.18  21.65  35.27  54.32 

!30.84  24.22  36.52  !57.89 

|31.94  26.27  38.14  57.37 

!33.49  27.89  39.15  !62.42 

A1 

MRF+PRF 

D1 

35.28!  26.42  38.62  60.53! 

34.78!  28.33  39.50  61.68! 

35.37!  29.88  40.15  61.89! 

!36.66!  31.42  40.88  !64.95 

Table  4:  Unigram  and  MRF+PRF  results  on  develop¬ 
ment  topics.  Statistical  significance  is  reported  for  map 
and  P010  (only)  by  prefix  f  and  J  comparing  against 
cell  above  (i.e.  less  feedback)  while  suffix  compares  Un¬ 
igram  vs.  MRF+PRF  runs  using  comparable  feedback. 

With  the  sole  exception  of  PRF  in  condition  C,  we 
see  PRF  and  MRF  modeling  each  yield  improvement 
over  the  baseline  across  feedback  conditions  with  MRF 
seen  to  be  the  stronger  of  the  two.  Furthermore,  the 
MRF +PRF  combination  achieves  additional  significant 
improvement  over  MRF  modeling  alone.  With  condi¬ 
tion  E  (not  shown) ,  neither  PRF  or  the  MRF  model  im¬ 
proved  over  the  baseline.  However,  this  result  is  incon¬ 
clusive  since  condition  E  development  set  results  could 
not  be  evaluated  without  retrieved  feedback  documents. 

We  submitted  nine  runs  for  official  evaluation:  five 
unigram  runs  with  no  PRF  (conditions  A-E)  and  four 
MRF+PRF  runs  (conditions  A-D).  No  MRF+PRF  run 
was  submitted  for  condition  E  since  we  did  not  observe 
improvement  from  either  technique  on  this  condition 
while  tuning.  Evaluation  of  these  runs  on  development 
topics  is  shown  in  Table  4.  Results  show  fairly  steady 
improvement  for  unigram  runs  but  a  more  complicated 
picture  for  MRF+PRF  runs.  While  gmap,  rprec,  and 
P010  steadily  improve  with  increasing  feedback,  map  is 
fiat  for  A-C.  However,  both  map  and  P@10  show  signifi¬ 
cant  improvement  for  condition  D. 

Results  on  Test  Topics 

Official  test  set  results  of  our  nine  submitted  runs  are 
presented  in  Table  5.  AP,  gmap,  rprec,  and  P010 
metrics  are  computed  on  top- 1000  retrieved  documents 


Model 

Run 

AP 

gmap 

rprec 

P@10 

MTC 

statAP 

A2 

13.43 

4.05 

16.48 

24.19 

4.90 

22.91 

B2 

4:17.09 

6.99 

21.09 

429.68 

6.22 

29.07 

Unigram 

C2 

419.50 

8.66 

22.66 

32.58 

7.03 

32.27 

D2 

20.64 

9.29 

23.67 

436.45 

7.06 

32.16 

El 

f24.75 

14.85 

27.35 

448.06 

7.32 

35.00 

A1 

21.46! 

11.43 

25.15 

32.90 

5.64 

27.99 

MRF+PRF 

B1 

20.96 

11.63 

23.56 

33.87 

6.04 

29.59 

Cl 

422.96! 

13.68 

25.75 

37.74 

7.01 

33.87 

D1 

424.29! 

14.93 

27.42 

40.65 

7.03 

32.16 

Table  5:  Official  results  of  our  runs  on  test  topics.  Run  name  indicates  feedback  condition  and  run  ID.  Runs  are 
divided  between  unigram  results  (no  PRF)  and  results  using  both  sequential  dependency  (Metzler  &  Croft  2005)  and 
PRF.  Statistical  significance  is  reported  for  map  and  P010  (only)  following  the  same  conventions  used  in  Table  4. 


j  MAP 

!  PCdlO  i 

System 

A-E 

B-E 

A-E 

B-E 

Brown 

22.89 

23.23 

38.64 

40.08 

uogRF09 

22.08 

22.68 

38.64 

38.87 

UAmsR08PD 

19.22 

20.09 

35.174 

36.784 

UIUC 

18.554 

20.094 

32.524 

35.414 

FubRF08 

17.854 

19.584 

32.264 

35.484 

Table  6:  Relative  performance  achieved  by  five  of  the  top  systems  participating  in  the  track,  as  measured  by  simply 
averaging  official  test  topic  MAP  and  P@10  accuracies  across  the  various  feedback  conditions.  Column  “A-E” 
averages  over  all  conditions,  while  “B-E”  compares  feedback  conditions  only  (no  ad  hoc  “A” ) .  Statistical  significance 
measured  by  a  two-tailed  paired  t-test  is  reported  for  low  significancef  (p  <  .05)  and  high  significance!  (p  <  .01). 
Refer  to  track  overview  (Buckley  &  Robertson  2008)  and  official  track  results  for  more  detailed  comparison. 


with  relevance  determined  by  NIST  pooling  assessment 
of  31  Terabyte  track  topics.  The  pool  consisted  of  the 
top- 10  ranked  documents  from  each  run  submitted  by  a 
participant.  MTC  corresponds  to  Carterette  et  al.’s  Min¬ 
imal  Test  Collections  evaluation  algorithm  (Carterette, 
Allan,  &  Sitaraman  2006)  and  statAP  comes  from 
Aslam  and  Pavlu’s  statistical  MAP  estimation  proce¬ 
dure  (Aslam,  Pavlu,  &  Yilmaz  2006);  both  algorithms 
were  used  in  the  TREC  Million-query  Track.  Million- 
query  track  runs  also  contributed  to  the  pools. 

Unigram  results  demonstrate  a  steady  improvement 
in  retrieval  accuracy  across  all  but  gmap  metrics  with 
growing  amounts  of  feedback.  The  largest  AP  improve¬ 
ment  is  seen  moving  to  condition  E’s  large  amount  of 
feedback  (4.11%  absolute  over  condition  D).  A  slightly 
smaller  AP  improvement  is  seen  as  we  go  from  ad  hoc 
retrieval  (condition  A)  to  condition  B’s  having  a  single 
relevant  document:  3.66%  (absolute).  Similar  trend¬ 
ing  is  observed  with  high-rank  P@10  retrieval:  11.61% 
and  5.49%,  respectively  (absolute).  Regarding  gmap, 
it  would  seem  topic  drift  caused  by  feedback  is  seen 
to  hurt  performance,  though  this  loss  diminishes  as 
greater  feedback  reduces  drift.  However,  note  a  very 
different  trend  is  observed  on  development  topics  (Ta¬ 
ble  4).  It  may  be  this  difference  in  trends  is  simply  a 
byproduct  of  differences  between  how  feedback  docu¬ 
ments  were  selected  for  development  and  test  sets.  On 
the  other  hand,  since  official  evaluation  only  included 


top-10  ranked  documents  in  pooling,  assessment  may 
have  been  biased  in  favor  of  easier  topics  for  which  many 
relevant  documents  would  be  seen  early  in  the  ranked 
list.  Finally,  since  we  use  identical  system  configura¬ 
tions  for  conditions  C  and  D  (which  provide  compara¬ 
ble  feedback),  we  expected  their  results  should  be  quite 
similar,  and  MTC  and  statAP  metrics  bear  this  out. 

MRF+PRF  results  are  less  clear  in  that  condition  B 
results  decline  in  comparison  to  ad  hoc  retrieval  under 
AP  and  rprec  metrics  while  improving  under  all  other 
metrics.  This  drop  is  likely  due  to  overfitting.  Other¬ 
wise  similar  trends  are  observed:  we  see  improvement 
with  increasing  feedback.  C  and  D  conditions  again  ap¬ 
pear  roughly  comparable,  with  D  generally  performing 
slightly  better  except  in  the  case  of  statAP. 

Table  6  shows  the  relative  strength  of  our  overall 
system  in  comparison  to  four  other  competitive  sub¬ 
missions  to  the  2008  TREC  Relevance  Feedback  track. 
Performance  is  summarized  by  simply  averaging  official 
MAP  and  P@10  accuracies  across  the  various  feedback 
conditions.  Results  shown  our  system  typically  per¬ 
formed  as  well  or  better  than  peer  systems.  The  track 
overview  (Buckley  &  Robertson  2008)  and  official  track 
results  provide  more  thorough  details  for  comparison. 

Conclusion 

This  paper  investigated  combination  of  relevance  feed¬ 
back,  pseudo-relevance  feedback,  and  Markov  random 


field  modeling  techniques  for  document  retrieval.  Using 
a  large  web  collection,  we  evaluated  an  overall  combina¬ 
tion  strategy  while  assessing  the  contribution  from  each 
component  in  presence  of  the  others.  Given  0-5  feed¬ 
back  documents,  we  found  each  component  contributed 
unique  value  to  the  overall  ensemble,  achieving  signifi¬ 
cant  improvement  individually  and  in  combination. 

Comparative  evaluation  in  the  2008  TREC  Relevance 
Feedback  track  further  showed  our  complete  system 
typically  performs  as  well  or  better  than  other  peer 
systems.  Use  of  proximity  (e.g.  features  in  our  MRF 
model)  and/or  PRF  was  generally  seen  to  help  in  com¬ 
bination  with  RF  across  participating  systems  that  em¬ 
ployed  one  or  the  other.  Use  of  negative  feedback  (e.g. 
via  Rocchio)  generally  provided  little  benefit.  Interest¬ 
ingly,  all  of  the  competitive  participants’  systems  dis¬ 
played  some  form  on  non-monotonicity  in  accuracy  with 
increasing  feedback.  While  we  identified  problems  with 
overfitting  in  our  system,  as  discussed  earlier,  it  remains 
to  be  seen  this  is  explanation  is  sufficient  in  general. 

While  our  approach  to  RF  in  this  paper  was  lim¬ 
ited  to  unigram  feedback,  future  work  will  explore  term 
dependency  selection  from  feedback  documents  for  in¬ 
corporation  into  fo  and  fjj  MRF  components  (Equa¬ 
tion  10).  Previous  work  has  shown  little  benefit  from 
PRF  dependency  modeling  (Metzler  &  Croft  2007a), 
but  RF  dependency  modeling  may  prove  to  be  more 
helpful.  We  would  also  like  to  explore  use  of  RF  in  con¬ 
junction  with  supervised  unigram  modeling  (Bendersky 
&  Croft  2008;  Lease,  Allan,  &  Croft  2009). 
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